Space-efficient Data Structures for String Searching and Retrieval

نویسنده

  • Stephane Durocher
چکیده

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 The Models of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Generalized Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Compressed Suffix Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Bit Vectors with Rank/Select Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Succinct Representation of Ordinal Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.6 Document Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.7 Differentially Encoding a Sorted Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.8 String B-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3: External Memory Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Preliminary: Top-k Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 External Memory Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Breaking Down into Sub-Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Converting Top-k to Threshold via Logarithmic Sketch . . . . . . . . . . . . . . . 14 3.2.3 Special Structures for Bounded k . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 I/O-Optimal Data Structure via Bootstrapping . . . . . . . . . . . . . . . . . . . . 18 3.3 Adapting to Internal Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 4: Succinct Space Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Our Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.1 The Compressed Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Faster Compressed Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 5: Compact Space Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 The Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Storing and Retrieving the Lists top(x, z) . . . . . . . . . . . . . . . . . . . . . . . . . . 33 iv 5.4 Completing the Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4.1 Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.4.2 Computing Scores Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.5 Reducing the Time to O(p+ k log∗ k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 6: Multipattern Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.1 Handling m > 2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexing Textual Information

Information retrieval is the computational discipline that deals with the efficient representation, organization, and access to information objects that represent natural language texts (Baeza-Yates, & Ribeiro-Neto, 1999; Salton & McGill, 1983; Witten, Moûat, & Bell, 1999). A crucial subproblem in the information retrieval area is the design and implementation of efficient data structures and a...

متن کامل

Upper and Lower Bounds for Text Upper and Lower Bounds for Text Indexing Data Structures

The main goal of this thesis is to investigate the complexity of a variety of problems related to text indexing and text searching. We present new data structures that can be used as building blocks for full-text indices which occupies minute space (FM-indexes) and wavelet trees. These data structures also can be used to represent labeled trees and posting lists. Labeled trees are applied in XM...

متن کامل

INSTRUCT - Space-Efficient Structure for Indexing and Complete Query Management of String Databases

The tremendous expanse of search engines, dictionary and thesaurus storage, and other text mining applications, combined with the popularity of readily available scanning devices and optical character recognition tools, has necessitated efficient storage, retrieval and management of massive text databases for various modern applications. For such applications, we propose a novel data structure,...

متن کامل

siEDM: an efficient string index and search algorithm for edit distance with moves

Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has...

متن کامل

Phrase Based Document Retrieving by Combining Suffix Tree index data structure and Boyer- Moore faster string searching algorithm

Phrase has been considered as a more informative feature term for improving the effectiveness of document retrieval .This paper propose an Algorithm A Phrase Based Document Retrieval to retrieve the similar documents by combining two exiting algorithm suffix tree ,index data structure and “The Boyer-Moore Algorithm”, faster string searching algorithm. The suffix tree is constructed based on E. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014